setwd("/Users/yameili/Desktop/ecob2000_lecture1/acs2017_ny")
load("acs2017_ny_data.Rdata")
attach(acs2017_ny)
use_varb <- (AGE >= 25) & (AGE <= 55) & (LABFORCE == 2) & (WKSWORK2 > 4) & (UHRSWORK >= 35)
dat_use <- subset(acs2017_ny,use_varb) #
detach()
attach(dat_use)
This selects people 25-55 (often called prime age), in labor force, working year round, fulltime, We’ll figure out what factors affect income wages, first, we guess these factors can influence the income wages, and then we build a model to test.
model_temp1 <- lm(INCWAGE ~ AGE + female + AfAm + Asian + Amindian + race_oth + Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
summary(model_temp1)
##
## Call:
## lm(formula = INCWAGE ~ AGE + female + AfAm + Asian + Amindian +
## race_oth + Hispanic + educ_hs + educ_somecoll + educ_college +
## educ_advdeg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148088 -33205 -10708 13053 625543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7096.25 2446.71 -2.900 0.003730 **
## AGE 1316.69 39.66 33.199 < 2e-16 ***
## female -24939.46 720.43 -34.617 < 2e-16 ***
## AfAm -11934.26 1130.37 -10.558 < 2e-16 ***
## Asian 566.53 1369.83 0.414 0.679188
## Amindian -8858.57 6077.71 -1.458 0.144971
## race_oth -7526.49 1272.49 -5.915 3.35e-09 ***
## Hispanic -4224.82 1183.47 -3.570 0.000358 ***
## educ_hs 10592.37 1814.71 5.837 5.35e-09 ***
## educ_somecoll 22461.39 1857.67 12.091 < 2e-16 ***
## educ_college 57155.71 1830.96 31.216 < 2e-16 ***
## educ_advdeg 82766.43 1878.64 44.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76760 on 46959 degrees of freedom
## Multiple R-squared: 0.15, Adjusted R-squared: 0.1498
## F-statistic: 753.6 on 11 and 46959 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model_temp1,col="blue",pch=16,cex=1,lwd=1,lty=2)
require(stargazer)
## Loading required package: stargazer
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(model_temp1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## INCWAGE
## -----------------------------------------------
## AGE 1,316.691***
## (39.661)
##
## female -24,939.460***
## (720.433)
##
## AfAm -11,934.250***
## (1,130.372)
##
## Asian 566.528
## (1,369.834)
##
## Amindian -8,858.569
## (6,077.710)
##
## race_oth -7,526.487***
## (1,272.485)
##
## Hispanic -4,224.816***
## (1,183.469)
##
## educ_hs 10,592.370***
## (1,814.709)
##
## educ_somecoll 22,461.390***
## (1,857.674)
##
## educ_college 57,155.710***
## (1,830.963)
##
## educ_advdeg 82,766.430***
## (1,878.638)
##
## Constant -7,096.252***
## (2,446.712)
##
## -----------------------------------------------
## Observations 46,971
## R2 0.150
## Adjusted R2 0.150
## Residual Std. Error 76,755.980 (df = 46959)
## F Statistic 753.551*** (df = 11; 46959)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
from the graph, we can see that the p-value of Asian is 0.679188 and the p-value of Amindian is 0.144971,We can infer that the regression coefficients of these two variables are not significant, so we need to consider whether these two variables are suitable for our model. Besides,the Multiple R-squared is 0.15,This model can only explain the variance of 15%, so we need to optimize it.
Model optimization we need to draw the scatter graph between incomewage and Amindian,incomewageand Asian.
nasian<-as.numeric(as.character(dat_use$INCWAGE))
par(mfrow=c(2,2))
Wage_asian<-lm(INCWAGE~Asian)
plot(Wage_asian,col="purple",pch=14,cex=1,lwd=1,lty=2)
summary(Wage_asian)
##
## Call:
## lm(formula = INCWAGE ~ Asian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76998 -39981 -19981 12019 566019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71980.6 404.6 177.913 < 2e-16 ***
## Asian 5017.1 1286.1 3.901 9.6e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83230 on 46969 degrees of freedom
## Multiple R-squared: 0.0003239, Adjusted R-squared: 0.0003026
## F-statistic: 15.22 on 1 and 46969 DF, p-value: 9.598e-05
nasian<-as.numeric(as.character(dat_use$INCWAGE))
par(mfrow=c(2,2))
Wage_amindian<-lm(INCWAGE~Amindian)
plot(Wage_amindian,col="purple",pch=14,cex=1,lwd=1,lty=2)
summary(Wage_amindian)
##
## Call:
## lm(formula = INCWAGE ~ Amindian)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72553 -40553 -20553 12447 587481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72552.5 384.7 188.586 <2e-16 ***
## Amindian -22033.3 6571.2 -3.353 8e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83240 on 46969 degrees of freedom
## Multiple R-squared: 0.0002393, Adjusted R-squared: 0.000218
## F-statistic: 11.24 on 1 and 46969 DF, p-value: 8e-04
According to the graph, there is no direct relationship between these two variables and incomewage, we should remove these two variables from the model1, and then carry out a stepwise regression,
step(model_temp1)
## Start: AIC=1056708
## INCWAGE ~ AGE + female + AfAm + Asian + Amindian + race_oth +
## Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg
##
## Df Sum of Sq RSS AIC
## - Asian 1 1.0077e+09 2.7666e+14 1056706
## <none> 2.7666e+14 1056708
## - Amindian 1 1.2516e+10 2.7667e+14 1056708
## - Hispanic 1 7.5080e+10 2.7673e+14 1056719
## - educ_hs 1 2.0072e+11 2.7686e+14 1056740
## - race_oth 1 2.0611e+11 2.7686e+14 1056741
## - AfAm 1 6.5671e+11 2.7731e+14 1056817
## - educ_somecoll 1 8.6131e+11 2.7752e+14 1056852
## - educ_college 1 5.7410e+12 2.8240e+14 1057671
## - AGE 1 6.4932e+12 2.8315e+14 1057796
## - female 1 7.0601e+12 2.8372e+14 1057890
## - educ_advdeg 1 1.1435e+13 2.8809e+14 1058608
##
## Step: AIC=1056706
## INCWAGE ~ AGE + female + AfAm + Amindian + race_oth + Hispanic +
## educ_hs + educ_somecoll + educ_college + educ_advdeg
##
## Df Sum of Sq RSS AIC
## <none> 2.7666e+14 1056706
## - Amindian 1 1.2451e+10 2.7667e+14 1056706
## - Hispanic 1 8.9401e+10 2.7675e+14 1056719
## - educ_hs 1 1.9974e+11 2.7686e+14 1056738
## - race_oth 1 2.4610e+11 2.7691e+14 1056746
## - AfAm 1 6.6303e+11 2.7732e+14 1056817
## - educ_somecoll 1 8.6128e+11 2.7752e+14 1056850
## - educ_college 1 5.7430e+12 2.8240e+14 1057669
## - AGE 1 6.4922e+12 2.8315e+14 1057794
## - female 1 7.0610e+12 2.8372e+14 1057888
## - educ_advdeg 1 1.1441e+13 2.8810e+14 1058607
##
## Call:
## lm(formula = INCWAGE ~ AGE + female + AfAm + Amindian + race_oth +
## Hispanic + educ_hs + educ_somecoll + educ_college + educ_advdeg)
##
## Coefficients:
## (Intercept) AGE female AfAm Amindian
## -7004 1316 -24941 -11965 -8835
## race_oth Hispanic educ_hs educ_somecoll educ_college
## -7282 -4378 10547 22409 57128
## educ_advdeg
## 82740
When we did the stepwise regression, the original AIC value was 1056708. In the first regression, we removed the Variable Asian and the AIC value changed to 1056706, which is consistent with our judgment based on the figure above.so we need to delete Asian variable and Amindian varible. Next, we will analyze the impact of each variable on wages. AGE~INCOME WAGE
NNobs <- length(INCWAGE)
set.seed(00000)
graph_obs <- (runif(NNobs) < 0.1)
dat_graph <-subset(dat_use,graph_obs)
plot(INCWAGE ~ jitter(AGE, factor = 2), pch = 19, col = rgb(0.3, 0.4, 0.5, alpha = 0.8), ylim = c(0,150000), data = dat_graph)
to_be_predicted2 <- data.frame(AGE = 25:55, female = 1, AfAm = 0, Asian = 0, Amindian = 1, race_oth = 1, Hispanic = 1, educ_hs = 0, educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
to_be_predicted2$yhat <- predict(model_temp1, newdata = to_be_predicted2)
lines(yhat ~ AGE, data = to_be_predicted2, col="red",cex=2,lwd=2,lty=2)
nage<-as.numeric(as.character(dat_use$INCWAGE))
par(mfrow=c(2,2))
Wage_age<-lm(INCWAGE~AGE)
plot(Wage_age,col="pink",pch=20,cex=1,lwd=1,lty=1)
summary(Wage_age)
##
## Call:
## lm(formula = INCWAGE ~ AGE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86853 -38985 -18021 12041 580882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32338.6 1747.3 18.51 <2e-16 ***
## AGE 991.2 42.1 23.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82760 on 46969 degrees of freedom
## Multiple R-squared: 0.01166, Adjusted R-squared: 0.01164
## F-statistic: 554.2 on 1 and 46969 DF, p-value: < 2.2e-16
As can be seen from the figure, age has positively correlated with income wage. From the age of 25 to 55, the income increases with the increase of age. In addition, the income of different age groups is concentrated between 3000 and 10000, and the number of people with income reaching 150,000 increases from the age of 40.
To improve the accuracy, we performed polynomial regression on the independent variable of age.
nage<-as.numeric(as.character(dat_use$INCWAGE))
par(mfrow=c(2,2))
Wage_age2<-lm(INCWAGE~AGE+I(AGE^2),data=dat_use)
plot(Wage_age2,col="pink",pch=20,cex=1,lwd=1,lty=1)
summary(Wage_age2)
##
## Call:
## lm(formula = INCWAGE ~ AGE + I(AGE^2), data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82216 -39552 -17216 12401 593941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004e+05 8.440e+03 -11.90 <2e-16 ***
## AGE 7.932e+03 4.338e+02 18.28 <2e-16 ***
## I(AGE^2) -8.612e+01 5.358e+00 -16.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82530 on 46968 degrees of freedom
## Multiple R-squared: 0.01707, Adjusted R-squared: 0.01703
## F-statistic: 407.8 on 2 and 46968 DF, p-value: < 2.2e-16
After polynomial regression, we can see that the accuracy has been improved, indicating that age has a great influence on salary.
female~INCOME WAGE
str(female)
## num [1:46971] 0 1 0 1 1 0 0 0 1 0 ...
model_temp2 <- lm(INCWAGE ~ female, data = dat_use)
par(mfrow=c(2,2))
plot(model_temp2, col="gray")
summary(model_temp2)
##
## Call:
## lm(formula = INCWAGE ~ female, data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80964 -39212 -18212 12788 575788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80963.9 515.9 156.95 <2e-16 ***
## female -18752.3 766.8 -24.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82720 on 46969 degrees of freedom
## Multiple R-squared: 0.01257, Adjusted R-squared: 0.01255
## F-statistic: 598 on 1 and 46969 DF, p-value: < 2.2e-16
F test:p-value: < 2.2e-16, There is a relationship between gender and income.
SEX_f <- factor(dat_use$SEX)
summary(SEX_f)
## Male Female
## 25713 21258
boxplot(dat_use$INCWAGE~dat_use$SEX,data=dat_use,col="red",main="wage~sex",xlab="income",ylab="sex",horizontal = TRUE)
we can see that the mean of male’s wage greater than the mean of female’s wage,and for the number of max wage,male are more than the female.In high-income groups, men exceed women.
model_temp3 <- lm(INCWAGE ~educ_advdeg, data = dat_use)
par(mfrow=c(2,2))
plot(model_temp3,col="green")
summary(model_temp3)
##
## Call:
## lm(formula = INCWAGE ~ educ_advdeg, data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113149 -36801 -16801 12199 576199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61801.0 417.8 147.91 <2e-16 ***
## educ_advdeg 51347.7 916.4 56.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80600 on 46969 degrees of freedom
## Multiple R-squared: 0.06266, Adjusted R-squared: 0.06264
## F-statistic: 3140 on 1 and 46969 DF, p-value: < 2.2e-16
model_temp4 <- lm(INCWAGE ~educ_advdeg, data = dat_use)
par(mfrow=c(2,2))
plot(model_temp4,col="green")
summary(model_temp4)
##
## Call:
## lm(formula = INCWAGE ~ educ_advdeg, data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113149 -36801 -16801 12199 576199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61801.0 417.8 147.91 <2e-16 ***
## educ_advdeg 51347.7 916.4 56.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80600 on 46969 degrees of freedom
## Multiple R-squared: 0.06266, Adjusted R-squared: 0.06264
## F-statistic: 3140 on 1 and 46969 DF, p-value: < 2.2e-16
model_temp5 <- lm(INCWAGE ~educ_advdeg+I(educ_advdeg^2), data = dat_use)
par(mfrow=c(2,2))
plot(model_temp5,col="green")
summary(model_temp5)
##
## Call:
## lm(formula = INCWAGE ~ educ_advdeg + I(educ_advdeg^2), data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113149 -36801 -16801 12199 576199
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61801.0 417.8 147.91 <2e-16 ***
## educ_advdeg 51347.7 916.4 56.03 <2e-16 ***
## I(educ_advdeg^2) NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80600 on 46969 degrees of freedom
## Multiple R-squared: 0.06266, Adjusted R-squared: 0.06264
## F-statistic: 3140 on 1 and 46969 DF, p-value: < 2.2e-16
new.model<-lm(INCWAGE ~AGE+ female + educ_somecoll + educ_college + educ_advdeg,data = dat_use)
par(mfrow=c(2,2))
plot(new.model,col="gold",pch=20,cex=1,lwd=1,lty=1)
require(stargazer)
stargazer(new.model, type = "text")
##
## ================================================
## Dependent variable:
## ----------------------------
## INCWAGE
## ------------------------------------------------
## AGE 1,343.870***
## (39.636)
##
## female -25,581.830***
## (719.294)
##
## educ_somecoll 14,406.760***
## (1,010.572)
##
## educ_college 49,929.340***
## (946.274)
##
## educ_advdeg 75,991.160***
## (1,020.901)
##
## Constant -3,048.084*
## (1,809.962)
##
## ------------------------------------------------
## Observations 46,971
## R2 0.146
## Adjusted R2 0.146
## Residual Std. Error 76,936.290 (df = 46965)
## F Statistic 1,604.883*** (df = 5; 46965)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
lm1<-lm(INCWAGE ~AGE+ female + educ_somecoll + educ_college + educ_advdeg,data = dat_use)
summary(lm1)
##
## Call:
## lm(formula = INCWAGE ~ AGE + female + educ_somecoll + educ_college +
## educ_advdeg, data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146856 -33301 -10864 13016 618251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3048.08 1809.96 -1.684 0.0922 .
## AGE 1343.87 39.64 33.905 <2e-16 ***
## female -25581.84 719.29 -35.565 <2e-16 ***
## educ_somecoll 14406.76 1010.57 14.256 <2e-16 ***
## educ_college 49929.34 946.27 52.764 <2e-16 ***
## educ_advdeg 75991.16 1020.90 74.435 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76940 on 46965 degrees of freedom
## Multiple R-squared: 0.1459, Adjusted R-squared: 0.1458
## F-statistic: 1605 on 5 and 46965 DF, p-value: < 2.2e-16
y=-3048.08+1343.87X1-25581.84X2+14406.76X3+49929.34X4+75991.16X5
par(mfrow=c(2,2))
plot(lm1,col="orange",pch=5,cex=1)
lm2<-update(lm1,.~. -female)
summary(lm2)
##
## Call:
## lm(formula = INCWAGE ~ AGE + educ_somecoll + educ_college + educ_advdeg,
## data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132859 -33770 -12018 11591 614027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12161.04 1815.68 -6.698 2.14e-11 ***
## AGE 1338.31 40.17 33.320 < 2e-16 ***
## educ_somecoll 11922.21 1021.63 11.670 < 2e-16 ***
## educ_college 47168.94 955.69 49.356 < 2e-16 ***
## educ_advdeg 71413.33 1026.29 69.584 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77960 on 46966 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1228
## F-statistic: 1646 on 4 and 46966 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm2,col="orange")
lm3<-update(lm1,.~. -female-AGE)
summary(lm3)
##
## Call:
## lm(formula = INCWAGE ~ educ_somecoll + educ_college + educ_advdeg,
## data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113149 -34149 -14303 11697 593697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44303.4 659.5 67.176 <2e-16 ***
## educ_somecoll 10210.3 1032.3 9.891 <2e-16 ***
## educ_college 42215.1 955.1 44.198 <2e-16 ***
## educ_advdeg 68845.3 1035.4 66.491 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78880 on 46967 degrees of freedom
## Multiple R-squared: 0.1022, Adjusted R-squared: 0.1021
## F-statistic: 1782 on 3 and 46967 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm3,col="orange")
lm4<-update(lm1,.~. +female*AGE)
summary(lm4)
##
## Call:
## lm(formula = INCWAGE ~ AGE + female + educ_somecoll + educ_college +
## educ_advdeg + AGE:female, data = dat_use)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150330 -33605 -10534 12810 616856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13228.61 2325.45 -5.689 1.29e-08 ***
## AGE 1596.66 53.72 29.722 < 2e-16 ***
## female -3407.80 3262.77 -1.044 0.296
## educ_somecoll 14473.40 1010.11 14.329 < 2e-16 ***
## educ_college 49755.79 946.12 52.589 < 2e-16 ***
## educ_advdeg 75742.26 1021.01 74.184 < 2e-16 ***
## AGE:female -547.20 78.54 -6.967 3.27e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76900 on 46964 degrees of freedom
## Multiple R-squared: 0.1468, Adjusted R-squared: 0.1467
## F-statistic: 1347 on 6 and 46964 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm4,col="orange")
new<-data.frame(female=0,AGE = 35,educ_somecoll = 0, educ_college = 1, educ_advdeg = 0)
lm.pred<-predict(lm1,new,interval="prediction",level=0.95)
lm.pred
## fit lwr upr
## 1 93916.72 -56887 244720.5
we can predict a man who has a college educational level, at age 35, his income interval=(-56887,244720.5) at the 95% confidence level.
In conclusion. Age, gender and education background are both factors affecting income, among which the average income of males is greater than that of females. Moreover, in high-income groups, the number of males is greater than that of females. For education background, the higher education background is, the higher income is, and the income of people with a graduate degree is much higher than that of people with a high school education.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.